Import libraries and get data

Solution from Kaggle Titanic

Data inofrmation:

Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)



In [51]:

    
#import libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Get data
train = pd.read_csv('TRAIN.csv')
test = pd.read_csv('TEST.csv')

Analyse data

Visualize the first 5 rows with the head() function.



In [3]:

    
# First 5 rows
train.head()









    Out[3]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Removing unused data

Removing the "Name", "Ticket" and "Cabin" from datasets (training and tests)



In [4]:

    
train.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
test.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
train.head()









    Out[4]:







  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Fare
      Embarked
    
  
  
    
      0
      1
      0
      3
      male
      22.0
      1
      0
      7.2500
      S
    
    
      1
      2
      1
      1
      female
      38.0
      1
      0
      71.2833
      C
    
    
      2
      3
      1
      3
      female
      26.0
      0
      0
      7.9250
      S
    
    
      3
      4
      1
      1
      female
      35.0
      1
      0
      53.1000
      S
    
    
      4
      5
      0
      3
      male
      35.0
      0
      0
      8.0500
      S

Generate one-hot (dummies) variables from categorical data

Using the 'get_dummies' function from Pandas to gerenate the one-hot encoders



In [5]:

    
one_hot_train = pd.get_dummies(train)
one_hot_test = pd.get_dummies(test)

# First five rows from train dataset
one_hot_train.head()









    Out[5]:







  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
      Sex_female
      Sex_male
      Embarked_C
      Embarked_Q
      Embarked_S
    
  
  
    
      0
      1
      0
      3
      22.0
      1
      0
      7.2500
      0
      1
      0
      0
      1
    
    
      1
      2
      1
      1
      38.0
      1
      0
      71.2833
      1
      0
      1
      0
      0
    
    
      2
      3
      1
      3
      26.0
      0
      0
      7.9250
      1
      0
      0
      0
      1
    
    
      3
      4
      1
      1
      35.0
      1
      0
      53.1000
      1
      0
      0
      0
      1
    
    
      4
      5
      0
      3
      35.0
      0
      0
      8.0500
      0
      1
      0
      0
      1



In [6]:

    
# First five rows from test dataset
one_hot_test.head()









    Out[6]:







  
    
      
      PassengerId
      Pclass
      Age
      SibSp
      Parch
      Fare
      Sex_female
      Sex_male
      Embarked_C
      Embarked_Q
      Embarked_S
    
  
  
    
      0
      892
      3
      34.5
      0
      0
      7.8292
      0
      1
      0
      1
      0
    
    
      1
      893
      3
      47.0
      1
      0
      7.0000
      1
      0
      0
      0
      1
    
    
      2
      894
      2
      62.0
      0
      0
      9.6875
      0
      1
      0
      1
      0
    
    
      3
      895
      3
      27.0
      0
      0
      8.6625
      0
      1
      0
      0
      1
    
    
      4
      896
      3
      22.0
      1
      1
      12.2875
      1
      0
      0
      0
      1

Check and dealing wiht null values



In [7]:

    
# Visualize the null values (train)
one_hot_train.isnull().sum().sort_values(ascending=False)









    Out[7]:





Age            177
Embarked_S       0
Embarked_Q       0
Embarked_C       0
Sex_male         0
Sex_female       0
Fare             0
Parch            0
SibSp            0
Pclass           0
Survived         0
PassengerId      0
dtype: int64



In [8]:

    
# Fill the null Age values with the mean of all ages
one_hot_train['Age'].fillna(one_hot_train['Age'].mean(), inplace=True)
one_hot_test['Age'].fillna(one_hot_test['Age'].mean(), inplace=True)
one_hot_train.isnull().sum()









    Out[8]:





PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
Sex_female     0
Sex_male       0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
dtype: int64



In [9]:

    
# Visualize the null values (test)
one_hot_test.isnull().sum().sort_values(ascending=False)









    Out[9]:





Fare           1
Embarked_S     0
Embarked_Q     0
Embarked_C     0
Sex_male       0
Sex_female     0
Parch          0
SibSp          0
Age            0
Pclass         0
PassengerId    0
dtype: int64



In [17]:

    
# Fill the null Fare values with the mean of all Fares
one_hot_test['Fare'].fillna(one_hot_test['Fare'].mean(), inplace=True)
one_hot_test.isnull().sum().sort_values(ascending=False)









    Out[17]:





Embarked_S     0
Embarked_Q     0
Embarked_C     0
Sex_male       0
Sex_female     0
Fare           0
Parch          0
SibSp          0
Age            0
Pclass         0
PassengerId    0
dtype: int64

Modeling

We are going to split the data into features and targer, create the model and verify the the score



In [60]:

    
# Creating the feature and the target
feature = one_hot_train.drop('Survived', axis=1)
target = one_hot_train['Survived']

# Model creation
rf = RandomForestClassifier(random_state=1, criterion='gini', max_depth=10, n_estimators=50, n_jobs=-1)
rf.fit(feature, target)









    Out[60]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=-1, oob_score=False, random_state=1,
            verbose=0, warm_start=False)



In [61]:

    
# Verifying score
rf.score(feature,target)









    Out[61]:





0.95398428731762064

Generate the CSV file with the results

We will use the Pandas to generate the CSV file with the results to be able to submit to Kaggle



In [62]:

    
# Generate a DataFrame with Padas with 'PassengerId' and 'Survived' colunms
submission = pd.DataFrame()
submission['PassengerId'] = one_hot_test['PassengerId']
submission['Survived'] = rf.predict(one_hot_test)

# Generate the CSV file with 'to_csv' from Pandas
submission.to_csv('submission.csv', index=False)



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Pclass	Age	SibSp	Parch	Fare	Sex_female	Sex_male	Embarked_Q	Embarked_S
0	892	3	34.5	0	0	7.8292	0	1	1	0
1	893	3	47.0	1	0	7.0000	1	0	0	1
2	894	2	62.0	0	0	9.6875	0	1	1	0
3	895	3	27.0	0	0	8.6625	0	1	0	1
4	896	3	22.0	1	1	12.2875	1	0	0	1